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Abstract 

We give a systolic algorithm and array for bidiagonalization of an n x n matrix 
in 0(nlog 2 n) time, using 0(n 2 ) cells. Bandedness of the input matrix may 
be effectively exploited. If the matrix is banded, with p nonzero sub diagonals 
and q nonzero superdiagonals, then 4 n ln(p 4 q) 4 0(n) clocks and 2 n(p 4 
q) 4 0((p+ q) 2 + n) cells are needed. This is faster than the best previously 
reported result by the factor log 2 e = 1.44 • • •. Moreover, in contrast to earlier 
systolic designs, which require the matrix to be preloaded into the array and 
the result matrix extracted after bidiagonalization, the present arrays are 
pipelined. 
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Abstract 

We give a systolic algorithm and array for bidiagonalization of an n x n matrix 
in 0(n logj n ) time, using 0(n 2 ) cells. Bandedness of the input matrix may 
be effectively exploited. If the matrix is banded, with p nonzero subdiagonals 
and q nonzero superdiagonals, then 4n In (p + q) + O(n) clocks and 2n(p + 
q'j 0((p+ q) 2 + n) c ells are needed. This is faster than the best previously 
reported result by the factor log 2 e = 1.44 • • •• Moreover, in contrast to earlier 
systolic designs, which require the matrix to be preloaded into the array and 
the result matrix extracted after bidiagonalization, the present arrays are 
pipelined. 

1 Introduction 

In this paper we present a group of new algorithms, and their implementation 
by systolic arrays, for computing the factorization 

B = PAQ (1) 
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where A is a given mxn matrix, B is an n x n upper bidiagonai matrix, Q 
is orthogonal and P has orthonormal rows. The method has a straightforward 
analog for the factorization 

T=PAP t (2) 

where A is a given symmetric n x n matrix, T is a symmetric tridiagonal 
matrix, ajicLP is orthogonal. These factorizations are the essential first steps 
of important algorithms for computing the singular value decomposition and 
the symmetric eigendecomposition [6j. 

To compute these factorizations with systolic arrays is a long*3tanding 
problem. Johnsson [7] developed an implementation of the Householder 
method that requires 0(n 2 ) time. All later work has examined algorithms 
that use plane rotations. Kung and Gal-Ezar gave a tridiagonalization method 
that requires 0(n 2 ) time [8]. It was felt by some that this was the optimal 
time complexity for systolic tridiagonalization; and it was later proved by 
Carlsson, et.al., that any algorithm that zeros the off- tridiagonal matrix el- 
ements in a column-by-column fashion and does not introduce a nonzero 
into a position already zeroed must take 0(n 2 ) time on a systolic array [3]. 
Other algorithms, which do some extra work by allowing fill-in of positions 
already zeroed, have better systolic possibilities. Schreiber [12] gave a pair 
of arrays for tridiagonalization in 0(n 3/2 ) time using 0(n 3 ^) processors. 
Finally, Bojanczyk and Brent [l] gave a systolic tridiagonalization method 
using 4n log 2 n + O(n) time and n 2 processors. 

Ipsen [9] presented some systolic implementations of a new bidiagonaliza- 
tion algorithm, without obtaining an 0(n log n) implementation. This paper 
is largely a refinement of her ideas. We shall present new systolic designs 
that, like those of [1], require O(nlogn) time and 0(n 2 ) hardware. But the 
approach is quite different from that of [l]. It is faster. Our array can reduce 
an upper triangular matrix with m nonzero superdiagdnals to bidiagonai form 
in 4nlnm clocks, while the Bojanczyk/Brent array requires 4n log 2 m, which 
is more by a factor of log 2 « a 1.44 • • *. It is a pipelined design in which the 
matrix flows into the array and the resulting bidiagonai matrix flows out in 
a uniform manner, while for the Bojanczyk/Brent design the matrix must be 
preloaded and the result matrix extracted. Finally, an important advantage 
of the technique described here is that it is fairly simple to understand and 
describe. _ 

After computing either of the*factorizations (1} or (2), it is still necessary 
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to use an iterative technique to find the singular values of B. Schreiber has 
described a systolic array method for the latter problem that provides all the 
singular values and requires O(n) time and 0(n) processors [12]. 

In Section 2 we introduce an array for band matrix QR factorization that 
we use as a building block to construct arrays. The bidiagonalization array 
is described in Section 3. 


1.1 Notation 

We use upper case letters for matrices and the matching lower case letters 
for their elements. We say that A is ( p , <?)-banded if 


j < i — p =*» a,j = 0 


and 


j > i + q => dij = 0. 

We call the pair (p, q) the half-bandwidth of A and write hbw(A) = (p, q). 


2 The Band-QR Array 

In this section we describe a simple systolic array that we shall use as a tool, 
an array for QR factorization of a banded, square matrix, that was originally 
introduced by Heller and Ipsen [10]. Consider the processor array of Figure 
2 . 

The array is synchronous; all processors share a common clock. One clock 
period ("clock”) is the time needed by a cell to apply a plane rotation. 

Matrix elements not explicitly shown are all zero, and plane rotations 
not explicitly shown are all identity rotations. Elements of the ( p , g)-banded 
matrix A enter at the bottom of the array; element a,-,- enters cell j — i + p + 1 
at time i + j — 2. The array applies plane rotations to the rows of A in order 
to zero all elements entering cell 1 — the leftmost subdiagonal of A. Thus, 
the array computes a (p — 1,^ + 1) banded matrix Ai satisfying A\ = PA , 
where P is orthogonal. The (ij) th element of A\ leaves cell j — i -f p at time 
i + j; thus the transit time through the array is two clocks. Note that the 
rightmost ( q + l) st diagonal, which leaves from the cell at the right end of 
the array, must have at least p — 1 leading zeros. 
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Figure 1: The band-Qi? array 

Suppose we cascade r such arrays, where r < p. The output of the last 
array is a matrix 

A r = PA (3) 

such that hbw(A r ) = (p — r,q + r) and P is orthogonal. Furthermore, the 
diagonals that leave from the rightmost r cells all have at least p—r leading 
zeros. 

Note too that if r < q, then by introducing A T into the array we may 
compute 

A r = AQ (4) 

where hbw(A r ) = (p+ r, q — r), Q is orthogonal, and the leftmost r diagonals 
of A r have at least q — r leading zeros. It may be inconvenient to transpose a 
matrix; fortunately, it is equivalent to send A into an array that is the mirror 
image of the one in Figure 2. 

Finally, suppose that each of the leftmost r subdiagonals of A has s 
leading zeros. Then the r new superdiagonals created in the factorization 
(3) must have at least s r = s + p - r leading zeros. The corresponding 
relation holds for the factorization (4). The algorithm described in the next 
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section is based on the use of a sequence of factorizations of types (3) and 
(4) that in this way introduce more and more zeros at the top of a group of 
diagonals. 

3 Bidiagonalization by Systolic Array 

We shall first describe the new bidiagonalization algorithm, then its sys- 
tolic implementation. The algorithm we describe works for upper triangular, 
square matrices. If A is m x n, then it would first be necessary to upper- 
triangularize A by an orthogonal transformation. After factoring 

A = P*R (5) 

where R is upper triangular and Pq has orthonormal rows, we would bidiag- 
onalize R , giving the factorization 

B = PrRQ. (6) 

Then 


B = PrPoAQ 
= PAQ 

is the desired factorization (1). Systolic arrays for (5) are well-known [2,4, 
10,13]. 

The algorithm proceeds by reducing the bandwidth of R by stages until 
R becomes bidiagonal. At each stage, as few as one and as many as half 
of the excess superdiagonals to the right of the first superdiagonal may be 
eliminated. Suppose hbw(i?) = (0,m). Let 


m = ri + r 2 -I f r# + 1 (7) 

where each rjt is a positive integer such that, for A; = 1,2,... ,N , 

fk < rfc + i -| 1- rs + 1 


= s k . 
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At the k th stage, r k superdiagonals are eliminated and s k remain. In all, 
N stages are needed. The algorithm generates a sequence {i2jt} of upper 
triangular matrices such that 

R 0 = R) 


and 

* Rk = PkRk-lQk, ( 8 ) 

where P k and Q k are orthogonal and h.bw(R k ) = (0, s k ). Thus, B — Rn 
is upper bidiagonal ( s n = 1) and satisfies (6) with Pr = rU=;v-Pfc and 

We now show how diagonals of R are removed and its bandwidth reduced. 
The technique used was first employed by Ipsen [9]. It is easier to look at a 
particular case first. Suppose that R = Rq has hbw(i?) = (0, 5), so there axe 
four diagonals to be removed. Let rq = 2; thus s x = 3. Therefore, in the first 
stage, we want to remove the two outermost superdiagonals of Rq and leave 
three superdiagonals, thus creating R\ such that hbw(Rx) = (0,3). This is 
done as follows. 

First remove the two rightmost diagonals of Rq with a 2 x 6 band-QR 
array. Thus: 



*•3 r-2 



where S\ = RqV\. Clearly, hbw(Si) = (2,3) and the two subdiagonals of S\ 
have three leading zeros. Next, remove the two subdiagonals from S\ with a 
2x6 band-Qi? array. Thus: 
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where 

Ro ] = Ux Sj = U l RV 1 . 

Clearly, hbw(i?Q 1 ^) = (0,5), and its two outer superdiagonals have three 
leading zeros. Continuing, we have 

R ( 0 j) = U j ---U 1 RV 1 ---V j , j = 0,1,... 

where hbw(i?o4 = (0, 5) and the outer two superdiagonals have 3 j leading 
zeros. Let Ri = Rq?\ where J — |*(n — 4)/3]. Clearly R\ is upper triangular 
and the fourth and fifth superdiagonals of i?i have at least 3 J >= n — 4 
leading zeros. Thus they are entirely zero, and hbw(i?i) = (0, 3) as we wished. 
In general, 

Rk = 4 - 1 } 

= Uj — lhRk-tVf-Vj 
= PkRk-iQk 

where J = J(k) = [(n— (sfc + l))/sjt]. Each matrix Vj zeros r* superdiagonals 
of R)?~i creating 

Si = r£ (9) 

which has r*. subdiagonals having jsk leading zeros. Next, Uj removes the 
subdiagonals from Sj, giving 

Ri j h = UjSj, (io) 

which has jsk leading zeros in its n, outermost superdiagonals. Thus, J(k) 
steps suffice to remove all of these superdiagonals. 

The problem of bidiagonalizing a banded, upper triangular matrix has, 
of course, been studied before. The original work is Rutishauser’s [11]. An 
analysis of several schemes was given by Golub, Luk, and Overton [5]. The 
scheme employed by Bojanczyk and Brent is a hybrid of the schemes Band 
Givens I and Band Givens II of [5]. An operation count for the present 
algorithm, summarized in Table 3, shows that it uses about twice as many 
operations as the best schemes discussed by Golub, Luk, and Overton. The 
first column of the table gives operation counts for algorithms that compute 
only B. The second column is for algorithms that accumulate the rotations 
in order to compute both Pr and Q. 
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Algorithm 

Reduction without 
vectors 

Reduction with 
vectors 

systolic 

(present paper) 

~ 8 mn 2 

~ 8n 3 In m 

Band Givens I [5] 

~ 4mn 2 

~ 4n 3 

Band Givens II [5] 

~ 4 mn 2 

~ 4n 3 In m 

Band Givens III [5] 

~ 4mn 2 

8n 3 



3 

Bojanczyk/Brent [1] 

~ 4mn 2 

~ 3.88n 3 lnm 


Table 1: Asymptotic multiplication counts for bidiagonalization of a matrix 
of order n with half- bandwidth (0, m) 


k 

rk 

Sk 

m 


1 

4 

6 

3 

24 

2 

3 

3 

6 

36 

3 

1 

2 

9 

18 

4 

1 

1 

18 

36 


t(k ) is the number of rows in the processor array; t(k) = 2r k J(k). 


Table 2: Bandwidth reduction; n = 20, m = 10, first partition 

3.1 The systolic bidiagonalization array 

The systolic array that carries out this algorithm will now be described. The 
process (9) can be accomplished by an r k x (1 + s k + r k ) band-Q-R array 
for removing superdiagonals by column rotations, as described in Section 2. 
The process (10) can be accomplished by a matching r k x (1 + s k + r k ) band- 
QR array for removing subdiagonals by row rotations; the two arrays form a 
single 2r k x (1 + s k + r k ) array, the output of one array becoming the input 
to the other. Thus, a sequence of J(k) pairs of band-Qi? arrays can carry 
out one full stage in the reduction of bandwidth (8); see Figure 2. 

The whole bidiagonalization process is accomplished by a sequence of such 
arrays. Again, to make it clear, we will look at a particular case. Suppose 
that R = R 0 has half bandwidth (0,10). The parameters of the algorithm 
are given in Table 2 and the array is illustrated in Figure 3. 

For the same problem, a different choice of partition (7) leads to a different 
bidiagonalization process. An alternate choice for the problem above leads 
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Figure 2: The bandwidth reduction array; r=2, s=3, n=10, J=2 
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Figure 3: Bidiagonalization array for first partition 
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k 

r k 


J(k) 

m 

1 

1 

9 

2 

4 

2 

1 

8 

2 

4 

3 

1 

7 

2 

4 

4 

1 

6 

3 

6 

5 

1 

5 

3 

6 

6 

1 

4 

4 

8 

7 

1 

3 

6 

12 

8 

1 

2 

9 

18 

9 

1 

1 

18 

36 


t(k) is the number of rows in the processor array; t(k) = 2 TkJ(k). 


Table 3: Bandwidth reduction; n = 20, m = 10, second partition 


to a somewhat more efficient scheme; the parameters are given in Table 3 
and an illustration in Figure 4 While 696 processing elements are needed for 
the array of Figure 3, only 498 processing elements are needed for the array 
of Figure 4. Moreover, the time required drops from 228 to 196 clocks. 

Since the transit time through a band-Qi? array having r rows is 2r 
clocks, it follows that 4 r* clocks are needed for a matrix to traverse a single 
stage in Figure 2, that 4 r k J(k) clocks are needed to traverse the J(k) stages 
that carry out the k th reduction stage (8), and therefore that the total time 
required to traverse the array is 


T(n,m,WL,) 


N 


^2ir k J(k) 

1 


N 


£ 4r * 

ik=l 


r n 


sk - r 

Sk 


To simplify the analysis we consider instead 


T x 





4 
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array for second partition 


2 


Lemma 1 T\ is minimized by choosing N = m — 1 and = 1 for k — 


Proof: Choose any partition (7) of m. Suppose > 1. Replace rjt by a s 

many ones. This changes the term 


tk 


into the sum 


t = jr ( 0—1— _ i) 


< tk 


This process can continue until all r*, = 1. 


QED 


With this choice we have that 

T w 4Ti = 4 ((n - - (m - 1)) 

where H m _i, the m — 1 st harmonic number, satisfies 

H m - i « 7 + ln(m - 1) + 1 tt + 0(m~ 2 ) 

z{rn — 1 ) 

and 7, Euler’s constant, is .577 • • •. Thus, 

T sa 4n In m + 2.31n — 4m + o(n + m). 

An alternative, which maximizes T, is to choose r to be about half the 
remaining bandwidth at each step. In that case, N « log 2 m, and T « 
4(nlog 2 m — m). Thus the difference between the optimal and the worst 
choice of partition is, asymptotically, only a factor of log 2 e = 1.44 By 
comparison, the Bojanczyk/Brent array requires 4n log 2 m clocks; thus it is 
slower by the factor log 2 e than our best array. 
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The space requirement of our method is 

P = P(n, m, {r k }% =1 ) = £ *(*)(«* + r* + 1) 

Jfe=i 

cells. For the choice r k = l, N = m — 1 we have 

P „ Ey n ~p ' )(**+ 2) 

fe=l ' s k S 

= 2nro + 0(n lnm + m 2 ). 

Thus, about twice as many processors are needed as for the Bojanczyk/Brent 
array. 

If the matrices P and Q are needed, then additional computation must 
be done to accumulate the product of all the row rotations (which move left 
to right in the array) that is P , and the product of the column rotation 
(which move right to left) that is Q. ATxn array can accumulate the 
row rotations. The rotations enter at the left edge of the array in the format 
created by a band-Qi? array. The nxn identity matrix enters at the bottom, 
one column per processor. The array applies the rotations to this matrix, 
and their product leaves from the top. Obviously, there is also an array of 
this type for accumulating column rotations. 

3.2 Systolic tridiagonalization 

We will briefly describe the generalization of our bidiagonalization method 
to the symmetric tridiagonalization problem. Let A by a symmetric nxn 
matrix with hbw(A) = (m, m). Using an r x 2m + 1 band-QiE array, we 
compute the factorization 

B\ — UiA , (11) 

where hbw(5j) = (m— r, m+r) and the r newly created outer superdiagonals 
of Bi each has $ = m — r leading zeros. Next form the symmetric matrix 

A (1) = B X U{. (12) 

It is straightforward to show that has only m nonzero subdiagonals, so 
by its symmetry, hbw(A^) = (m,m); furthermore, the outermost r sub- 
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Figure 5: The tridiagonalization-step array, m = 3 ; r = l; n=:10. 


aad superdiagonals have s leading zeros. Thus, we have an analog of the 
bidiagonalization algorithm, with the same time and space req uirem ents. 
The only essential difference is that in each pass over the matrix, only one 
orthogonal matrix is generated (Ui above). It is applied from both the left 
and the right to the input matrix. This can be accomplished by a systolic 
3rray of 2 r rows, in which the first r rows carray out the factorization (II) 
and the following r rows form the product (12). To do this, plane rotations 
generated by ceils at the left edge of the first group of rows are sent, via long 
connections, to the right edge of the second group of rows. An illustration of 
the array is given in Figure 3.2. , ; — - - 
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